EDA

In this section, we showcase our primary dataset as well as supplementary datasets to get the bigger picture of what data we are working with.

The goal of this section is to explore how we can tentatively use our data in tandem with strategies and techniques found from our literature review in order to profile syndemic relationships for type II diabetes.

Packages

Demographic Data

With our research goal of evaluating how the social and demographic factors interact with diabetes in a syndemic relationship, it is important to understand the demographic breakdown of the group that was studied by the National Health and Nutrition Examination Study (NHANES).

The demographic data set for the study includes 15560 observations of 29 variables including information on race, gender, family income, education level, and language spoken. Names and summaries of each of the variables are shown below.

Missing Values

The data set has 682 missing values for the age variable and 2201 missing values for the ratio_family_income_poverty variable.

Distribution of Continuous Variables

There are two continuous variables in the demographics data set: age and family income.

The distribution of ages present in the survey is a bit skewed with higher numbers of younger people ( <20 years old) than any other age.

Note:

The Department of Health and Human Services (HHS) poverty guidelines were used as the poverty measure to calculate this ratio. So, the ratio was calculated as:

Ratio = (Total Annual Income)/(Poverty Guideline specific to each year)

The survey had a higher number of participants at five times or greater the poverty line than any other level.

Distribution of Categorical Variables

There are several categorical variables in the demographics data set that may have associations with higher rates of diabetes diagnosis.

The majority of survey participants were born in the United States.

The majority of survey participants have at least a high school diploma, but the numbers are fairly similar across all five levels of education.

Non-hispanic black and non-hispanic white are the two most frequent races present in the survey sample.

Gender and Education Stratified by Race

To understand how confounding variables may affect our analysis, it is important to compare the distributions of various factors such as gender and education level by other demographic factors such as race.

The gender distribution is fairly even across all races in the survey sample.

Education levels, on the other hand, do differ across different races. Therefore education and race may be confounders in analysis of the survey data.

Diabetes Data

The diabetes data set from the National Health and Nutrition Examination Study contains information of diagnosis and progression of disease for each participant in the study. This dataset contains 28 variables which include when participants were diagnosed, whether or not they are on insulin, how frequently they see a doctor, etc. Names and summaries of each of the variables are shown below.

Missing Values

There are missing values in the age_informed, insulin_length, num_dr_visits_past_year, and how_often_glucose_check variables. These missing values are likely for participants who have not been informed of a diabetes diagnosis.

Distribution of Diagnostic Variables

The vast majority of participants in the survey have not been informed of any signs of type II diabetes diagnosis. To understand some characteristics of the survey participants who have been diagnosed, we filtered the data to include only these participants and looked at the distribution of some key variables.

The majority of participants who have been diagnosed with type II diabetes were informed of this diagnosis between the ages of 40 and 70.

Among type II diabetics, about one third of participants were taking insulin at the time of the survey.

Health and Nutrition Data

The health and nutritional behavior data details participant’s food choices, such as Breastfeeding and other childhood feeding practices, Frequency of getting meals prepared away from home, Frequency of getting meals from fast food or pizza places, Use of convenience foods, and knowledge of the my plate program. Names and summaries of variables are shown below. The data represent 15560 individuals with 46 different variables observed.

Column Names: 

1. respondent_sequence_num
2. ever_breastfed_or_fed_breastmilk
3. age_stopped_breastfeeding_days
4. diet_healthiness
5. community_government_meals_delivered
6. eat_meals_at_community_senior_center
7. attend_kindergarten_thru_high_school
8. school_serves_school_lunches
9. school_serves_complete_breakfast_daily
10. summer_program_meal_free_reduced_price
11. meals_not_home_prepared_count
12. meals_from_fast_food_or_pizza_place_count
13. ready_to_eat_foods_past_30_days
14. frozen_meals_pizza_past_30_days

Data Types & Missing Values

Breastfeeding and Weaning

Table of respondents fed breast milk or breastfed:

  Value Frequency Percentage
1   Yes      2066   78.73476
2    No       558   21.26524

Summary Statistics for age stopped breastfeeding in days:

  mean_age_stopped_breastfeeding median_age_stopped_breastfeeding
1                       198.6769                              121
  sd_age_stopped_breastfeeding min_age_stopped_breastfeeding
1                     218.0595                  5.397605e-79
  max_age_stopped_breastfeeding
1                          1095

Nutritional Practices

In the distribution of healthiness ratings we see that the most common rating for participants in the survey is “Good” while the ratings “Poor” and “Excellent” are the least common.

The number of meals not prepared at home, number of fast food meals, and number of ready to eat meals variables all have similar distributions with the low numbers (0-1) being the most frequently seen and higher numbers being the least frequently seen.

Social Meal Support

Many survey participants attend schools where lunch and breakfast are served daily and the majority of survey participants are not receiving meals from the community/government of free/reduced meals at summer programs.

Education

Table of respondents who attended kindergarten through highschool:

  Value Frequency Percentage
1   Yes      3849   78.73476
2    No       753   21.26524

Laboratory Data

There are 43 XPT data of laboratory tested data taken from the NHANES website. With so many XPT files of laboratory data, the cleaned dataset therefore contains 337 columns of variables. Many are strongly correlated with each other as some variables are the same just in a different metric. Due to how many XPT files are being combined and how many variables exist in each file, manually removing these highly correlated columns was not done. Additionally after combining each file to a common Respondent Sequence ID number, many missing values exist in each row. There are missing values in each row due to the combining process of each data file.

The cleaning process removed rows where all columns except for the first are NaNs as well as columns where there were only 1 unique value in each row. Below is a summary of the dataset as well as some visualizations of chosen variables among many that we will consider in this project.

We looked at the distributions of levels of six different key biomolecules that were tested for in the laboratory data: albumine, creatinine, arsenic, triglyceride, cholesterol, and hemoglobin.

Albumine in Urine (ug/mL) Testing

Creatinine (mg/dL) Testing

Arsenic Total (ug/L) Testing

Triglyceride (mg/dL) Testing

Total Cholesterol (mg/dL) Testing

Hemoglobin (g/dL) Testing

Questionnaire Data

The NHANES questionnaire data set has over 40 different variables regarding social behaviours, employment status, mental health, physical health, insurance coverage, and more. For the sake of getting a comprehensive analysis, we have selected the five factors that we want to explore further. The factors are alcohol consumption, depression, health access, insurance, and occupation. These factors pique our interest the most and/or were mentioned frequently in our review of the literature. Each of these factors is a sub-data set with multiple variables. These variables can be a general overview of the topic, such as the alcohol data set’s question “have you have consumed alcohol,” or quite specific, such as the alcohol data set’s question “how many days have you consumed 12+ drinks in the past year.” We have decided to select variables that we believed were the most representative of the subject and could give us the best overview of the respondent’s behaviour without going into the specifics for each question. We have selected 1-4 variables per sub-topic and these are the variables that we have performed EDA and survival analysis. Our selection of these variables is not to say that other variables are less important, rather we want to focus on variables that provide the most information possible.

Alcohol Data

For the alcohol data set, there were many questions about specific alcohol consumption behaviours, such as the questions mentioned above. We have selected two variables for this section: ever_had_a_drink_of_any_kind_of_alcohol and avg_alcoholic_drinks_per_day_past_12_months. We have cleaned the data to remove any outliers and then performed exploratory data analysis.

General Alcohol Consumption

The analysis of the respondents’ answer to “ever had a drink of any kind of alcohol” shows that 89.6% of respondents have consumed alcohol before, and 10.4% have not. We next want to explore how much those who have consumed alcohol tend to consume on average.

How Much Alcohol Consumed Per Day

We find that a majority of the respondents are having between 1-3 drinks per day. Those three groups encompass approximately 80% of the data. Individuals reporting 4-6 drinks per day make up another 10% of the results, and those having 7+ drinks per day make up the other 10%.

Depression Data

For the depression data, we decided to explore all of the variables in the data set. Each variable formats the question similarly, asking how many days have you felt ___ and has the same set of answer choices: “not at all”, “several days”, and “more than half the days.”

In every question asked in the depression questionnaire, most than half of the time, the respondent said not at all. The “feeling tired or having little energy” and “trouble sleeping or sleeping too much” say higher proportions of “several days” and “more than half the days” responses. The next highest not-at-all to other answers ratio was in “poor appetite or overeating”, and the other questions are all fairly even.

Health Insurance Data

We decided to explore two variables from the health insurance data, which are whether or not the respondent is covered by insurance and if so, what kind of insurance do they have.

# A tibble: 4 × 3
  `Covered by Insurance?` Count Proportion
  <chr>                   <int>      <dbl>
1 Yes                     13671   0.879   
2 No                       1852   0.119   
3 Don't know                 29   0.00186 
4 Refused                     8   0.000514

We found that around 87.9% of the respondents were covered by insurance, and 11.9% percent were not.

# A tibble: 7 × 3
  `Insurance Type`                          No   Yes
  <chr>                                  <int> <int>
1 covered_by_chip                        15389   171
2 covered_by_medi_gap                    15462    98
3 covered_by_medicaid                    11381  4179
4 covered_by_medicare                    12968  2592
5 covered_by_other_government_insurance  14552  1008
6 covered_by_private_insurance            8457  7103
7 covered_by_state_sponsored_health_plan 14623   937

We also founds that the most popular type of insurance was private insurance, followed by medicaid, medicare, and other types of government insurance.

Access to Healthcare and Hospital Usage Data

For the access to health care and hospital usage data, we decided to look at two variables: general health care conditions, and whether or not respondents have a regular place to go to for health care.

Respondents reported that they were generally in good health condition, followed by very good and then excellent conditions.

Most respondents also have a consistent place to go to for health care, such as an urgent care or primary care physician.

Occupation Data

Finally, for the occupation data set we wanted to explore how much people are working and what their job status is.

From this graph we can learn the vast majority of the respondents are working 35-40 hours per week, followed by 45-50 hours, then 40-45 hours.

A majority of the respondents are working at a job or business, followed by a good proportion of those who are out of work.